Masader Form

JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

Masader Form

Please make sure first that the dataset is not included in the catalogue https://arbml.github.io/masader/

Email *

Name of the dataset *

For example CALLHOME: Egyptian Arabic Speech Translation Corpus

Subsets

The different subsets in the dataset if it is broken by dialects. For example, Algerian , 2000, sentences. Put every subset in a new line in the format subset-name, number of samples, type [tokens sentences, documents]

Link *

Direct link to the dataset repository

Huggingface Link

for example https://huggingface.co/datasets/labr

License *

Use shortcuts for example CC BY-SA 4.0,

Year *

Year of publishing the dataset/paper

Language *

multilingual

Dialect *

used mixed if the dataset contains multiple dialects

Domain *

Form *

text

spoken

sign

Collection Style *

crawling

crawling and annotation(translation)

crawling and annotation(other)

machine translation

human translation

manual curation

other

Description *

brief description of the dataset

Volume *

How many samples are in the dataset, this is closely related to the unit option. As an example if the dataset has 10K tweets you put the Volume: 10,000 and the Unit: sentences. Please don't use 10K or any abbreviations.

Unit *

tokens usually used for ner, pos tagging, etc. sentences for sentiment analysis , documents for text modelling tasks

tokens

sentences

documents

hours

Ethical Risks

social media datasets are considered mid risks as they might release personal information, others might contain hate speech as well so considered as high risk.

Low

Medium

High

Clear selection

Provider

Name of institution i.e. NYU Abu Dhabi

Derived From

If the dataset is extracted or collected from another dataset put the name of such dataset

Paper Title

Paper Link

Direct link to the pdf of the paper i.e. https://arxiv.org/pdf/2110.06744.pdf

Script *

Arab

Latn

Arab-Latn

Tokenized *

Is the dataset tokenized i.e الرجل = ال رجل

Yes

Host *

Where the data resides i.e. GitHub, GitLab, Kaggle, etc.

Access *

Free

Upon-Request

With-Fee

Cost

For example 1750 $

Test split *

Does the dataset have validation / test split.

Yes

Tasks *

If you choose "Other" use comma to separate multiple tasks , i.e. sarcasm detection, abusive language detection, etc.

machine translation

speech recognition

sentiment analysis

language modelling

topic classification

dialect identification

text generation

cross-lingual information retrieval

named entity recognition

question answering

information retrieval

part of speech tagging

language identification

summarization

speaker identification

transliteration

morphological analysis

offensive language detection

review classification

gender identification

fake news detection

dependency parsing

irony detection

meter classification

natural language inference

Other:

Required

Venue Title

venue shortcut i.e. ACL

Citations

number of citations

Venue Type

conference

workshop

journal

preprint

Clear selection

Venue Name

Full name i.e Associations of computation linguistics

Authors

Add all authors split by comma

Affiliations

Abstract

abstract of the published paper

Added by *

put your full name in English

Notes

A copy of your responses will be emailed to the address you provided.

Submit

Clear form

Never submit passwords through Google Forms.

reCAPTCHA

Privacy Terms

This content is neither created nor endorsed by Google. Report Abuse - Terms of Service - Privacy Policy

Forms